Introduction to variantannotation: package for the aggregation of genomic variant data

Author: C. Mazzaferro & Kathleen Fisch

Email: cmazzafe@ucsd.edu

Date: June 2016

Outline of Notebook

  1. Background
  2. Set Up File and Libraries
  3. Run Annovar
  4. Obtain data from myvariant.info
  5. Variant Filtering & File Creation
  6. Export CSV and VCF files

Background

This notebook will walk you through the steps of how variants coming from a VCF can be annotated efficiently and thoroughly using the package variantannotation. In particular, the package is aimed at providing a way of retrieving variant information using ANNOVAR and myvariant.info and consolidating it in conveninent formats. It is well-suited for bioinformaticians interested in aggregating variant information into a single database for ease of use and to provide higher analysis capabities.

The aggregation is performed specifically by structuring the data in lists of python dictionaries, with each variant being described by a multi-level dictionary. The choice was made due to the inconsistencies that exist between data availability, and the necessity to store their information in a flexible manner. Further, this specific format permits its parsing to a MongoDb instance (dictionaries are the python representation of JSON objects), which enables the user to efficiently store, query and filter such data.

Finally, the package also has the added functionality to create csv and vcf files from MongoDB. The class Filters allows the user to rapidly query data that meets certain criteria as a list of documents, and the class FileWriter can transform such list into more widely accepted formats such as vcf and csv files. It should be noted that here, the main differential the package offers is the ability to write these files preserving all the annotation data. In the vcf files, for instance, outputs will have a 'Otherinfo' column where all the data coming from ANNOVAR and myvariant.info is condensed (while still preserving its structure). For vcf files, outputs will have around ~120-200 columns, depending on the amount of variant data that can be retrieved from myvvariant.info.

Notes on required software

the following libraries will be install upon installing variantannotation:

  • myvariant
  • pysam
  • pymongo
  • pyvcf

Other libraries that are needed, but should natively e installed on most OS:

  • Watchdog
  • Pandas
  • Numpy

Further, a MongoDB database must be set up. Refer to the documentation page for more information. Similarly, ANNOVAR must be downloaded, alongside with its supporting databases (also listed on the documentation page).

Import libraries and specify file paths


In [19]:
os.getcwd()


Out[19]:
'/Users/carlomazzaferro/Documents/CCBB/jupyter-genomics/notebooks/dnaSeq/VAPr_Variant_Annotation_Prioritization'

In [22]:
import os
import re
import sys
import vcf
import time
import pysam
import myvariant
import collections
import numpy as np
import pandas as pd
sys.path.append(os.getcwd().replace("notebooks/dnaSeq/VAPr_Variant_Annotation_Prioritization", "src/dnaSeq/VAPr"))

#variantannotation functions
from variantannotation import annotate_batch
from variantannotation import create_output_files
from variantannotation import myvariant_parsing_utils
from variantannotation import mongo_DB_export
from variantannotation import utilities
from variantannotation import MongoDB_querying

In [2]:
ANNOVAR_PATH = '/data/annovar/'
FILE_NAMES = ['Tumor_RNAseq_variants.vcf', 'Tumor_targeted_seq.vcf', 'normal_targeted_seq.vcf', 'normal_blood_WGS.vqsr.vcf', 'somatic_mutect_old.vcf']
IN_PATH = '/data/ccbb_internal/interns/Carlo/test_vcf/'
OUT_PATH = '/data/ccbb_internal/interns/Carlo/test_vcf_out/'
vcf_file = IN_PATH

In [3]:
#Check if file paths are correctly pointing to the specified files.
for i in range(0, len(FILE_NAMES)):
    print IN_PATH+FILE_NAMES[i]


/data/ccbb_internal/interns/Carlo/test_vcf/Tumor_RNAseq_variants.vcf
/data/ccbb_internal/interns/Carlo/test_vcf/Tumor_targeted_seq.vcf
/data/ccbb_internal/interns/Carlo/test_vcf/normal_targeted_seq.vcf
/data/ccbb_internal/interns/Carlo/test_vcf/normal_blood_WGS.vqsr.vcf
/data/ccbb_internal/interns/Carlo/test_vcf/somatic_mutect_old.vcf

Run Annovar

This will run ANNOVAR. A csv file named tumortargcsvout.hg19_multianno.csv will appear in the OUT_PATH specified. The csv file can then be processed and integrated with the data coming from myvariant.info. This command may take a some time to run (5-30 minutes for each file depending on file size). To keep things simple, we can start by looking at one file only. Let's run annovar on it. In any case, if you have multiple files to work on, you can run them in parallel by running the block after the next one.


In [ ]:
utilities.run_annovar(ANNOVAR_PATH, IN_PATH+FILE_NAMES[0], OUT_PATH)


Currently working on VCF file: Tumor_RNAseq_variants, field avinput
Currently working on VCF file: Tumor_RNAseq_variants, field variant_function
Currently working on VCF file: Tumor_RNAseq_variants, field exonic_variant_function
Currently working on VCF file: Tumor_RNAseq_variants, field hg19_tfbsConsSites
Currently working on VCF file: Tumor_RNAseq_variants, field hg19_cytoBand
Currently working on VCF file: Tumor_RNAseq_variants, field hg19_targetScanS
Currently working on VCF file: Tumor_RNAseq_variants, field hg19_genomicSuperDups
Currently working on VCF file: Tumor_RNAseq_variants, field hg19_gwasCatalog
Currently working on VCF file: Tumor_RNAseq_variants, field hg19_esp6500siv2_all_filtered
Currently working on VCF file: Tumor_RNAseq_variants, field hg19_esp6500siv2_all_dropped
Currently working on VCF file: Tumor_RNAseq_variants, field 2015_08_filtered
Currently working on VCF file: Tumor_RNAseq_variants, field 2015_08_dropped
Currently working on VCF file: Tumor_RNAseq_variants, field hg19_snp138_filtered
Currently working on VCF file: Tumor_RNAseq_variants, field hg19_snp138_dropped
Currently working on VCF file: Tumor_RNAseq_variants, field hg19_ljb26_all_filtered
Currently working on VCF file: Tumor_RNAseq_variants, field hg19_ljb26_all_dropped
Currently working on VCF file: Tumor_RNAseq_variants, field hg19_cg46_filtered
Currently working on VCF file: Tumor_RNAseq_variants, field hg19_cg46_dropped
Currently working on VCF file: Tumor_RNAseq_variants, field hg19_cg69_filtered
Currently working on VCF file: Tumor_RNAseq_variants, field hg19_cg69_dropped
Currently working on VCF file: Tumor_RNAseq_variants, field hg19_popfreq_all_filtered
Currently working on VCF file: Tumor_RNAseq_variants, field hg19_popfreq_all_dropped
Currently working on VCF file: Tumor_RNAseq_variants, field hg19_clinvar_20140929_filtered
Currently working on VCF file: Tumor_RNAseq_variants, field hg19_clinvar_20140929_dropped
Currently working on VCF file: Tumor_RNAseq_variants, field hg19_cosmic70_filtered
Currently working on VCF file: Tumor_RNAseq_variants, field hg19_cosmic70_dropped
Currently working on VCF file: Tumor_RNAseq_variants, field hg19_nci60_filtered
Currently working on VCF file: Tumor_RNAseq_variants, field hg19_nci60_dropped
Currently working on VCF file: Tumor_RNAseq_variants, field csv

Annovar finished working on file : Tumor_RNAseq_variants has finished. A .csv file has been created in the 
 OUT_PATH directory
Currently working on VCF file: Tumor_targeted_seq, field avinput
Currently working on VCF file: Tumor_targeted_seq, field variant_function
Currently working on VCF file: Tumor_targeted_seq, field exonic_variant_function
Currently working on VCF file: Tumor_targeted_seq, field hg19_tfbsConsSites
Currently working on VCF file: Tumor_targeted_seq, field hg19_cytoBand
Currently working on VCF file: Tumor_targeted_seq, field hg19_targetScanS
Currently working on VCF file: Tumor_targeted_seq, field hg19_genomicSuperDups
Currently working on VCF file: Tumor_targeted_seq, field hg19_gwasCatalog
Currently working on VCF file: Tumor_targeted_seq, field hg19_esp6500siv2_all_filtered
Currently working on VCF file: Tumor_targeted_seq, field hg19_esp6500siv2_all_dropped
Currently working on VCF file: Tumor_targeted_seq, field 2015_08_filtered
Currently working on VCF file: Tumor_targeted_seq, field 2015_08_dropped
Currently working on VCF file: Tumor_targeted_seq, field hg19_snp138_filtered
Currently working on VCF file: Tumor_targeted_seq, field hg19_snp138_dropped
Currently working on VCF file: Tumor_targeted_seq, field hg19_ljb26_all_filtered
Currently working on VCF file: Tumor_targeted_seq, field hg19_ljb26_all_dropped
Currently working on VCF file: Tumor_targeted_seq, field hg19_cg46_filtered
Currently working on VCF file: Tumor_targeted_seq, field hg19_cg46_dropped
Currently working on VCF file: Tumor_targeted_seq, field hg19_cg69_filtered
Currently working on VCF file: Tumor_targeted_seq, field hg19_cg69_dropped
Currently working on VCF file: Tumor_targeted_seq, field hg19_popfreq_all_filtered
Currently working on VCF file: Tumor_targeted_seq, field hg19_popfreq_all_dropped
Currently working on VCF file: Tumor_targeted_seq, field hg19_clinvar_20140929_filtered
Currently working on VCF file: Tumor_targeted_seq, field hg19_clinvar_20140929_dropped
Currently working on VCF file: Tumor_targeted_seq, field hg19_cosmic70_filtered
Currently working on VCF file: Tumor_targeted_seq, field hg19_cosmic70_dropped
Currently working on VCF file: Tumor_targeted_seq, field hg19_nci60_filtered
Currently working on VCF file: Tumor_targeted_seq, field hg19_nci60_dropped
Currently working on VCF file: Tumor_targeted_seq, field csv

Annovar finished working on file : Tumor_targeted_seq has finished. A .csv file has been created in the 
 OUT_PATH directory
Currently working on VCF file: normal_targeted_seq, field avinput
Currently working on VCF file: normal_targeted_seq, field variant_function
Currently working on VCF file: normal_targeted_seq, field exonic_variant_function
Currently working on VCF file: normal_targeted_seq, field hg19_tfbsConsSites
Currently working on VCF file: normal_targeted_seq, field hg19_cytoBand
Currently working on VCF file: normal_targeted_seq, field hg19_targetScanS
Currently working on VCF file: normal_targeted_seq, field hg19_genomicSuperDups
Currently working on VCF file: normal_targeted_seq, field hg19_gwasCatalog
Currently working on VCF file: normal_targeted_seq, field hg19_esp6500siv2_all_filtered
Currently working on VCF file: normal_targeted_seq, field hg19_esp6500siv2_all_dropped
Currently working on VCF file: normal_targeted_seq, field 2015_08_filtered
Currently working on VCF file: normal_targeted_seq, field 2015_08_dropped
Currently working on VCF file: normal_targeted_seq, field hg19_snp138_filtered
Currently working on VCF file: normal_targeted_seq, field hg19_snp138_dropped
Currently working on VCF file: normal_targeted_seq, field hg19_ljb26_all_filtered
Currently working on VCF file: normal_targeted_seq, field hg19_ljb26_all_dropped
Currently working on VCF file: normal_targeted_seq, field hg19_cg46_filtered
Currently working on VCF file: normal_targeted_seq, field hg19_cg46_dropped
Currently working on VCF file: normal_targeted_seq, field hg19_cg69_filtered
Currently working on VCF file: normal_targeted_seq, field hg19_cg69_dropped
Currently working on VCF file: normal_targeted_seq, field hg19_popfreq_all_filtered
Currently working on VCF file: normal_targeted_seq, field hg19_popfreq_all_dropped
Currently working on VCF file: normal_targeted_seq, field hg19_clinvar_20140929_filtered
Currently working on VCF file: normal_targeted_seq, field hg19_clinvar_20140929_dropped
Currently working on VCF file: normal_targeted_seq, field hg19_cosmic70_filtered
Currently working on VCF file: normal_targeted_seq, field hg19_cosmic70_dropped
Currently working on VCF file: normal_targeted_seq, field hg19_nci60_filtered
Currently working on VCF file: normal_targeted_seq, field hg19_nci60_dropped
Currently working on VCF file: normal_targeted_seq, field csv

Annovar finished working on file : normal_targeted_seq has finished. A .csv file has been created in the 
 OUT_PATH directory
Currently working on VCF file: normal_blood_WGS, field avinput
Currently working on VCF file: normal_blood_WGS, field variant_function
Currently working on VCF file: normal_blood_WGS, field exonic_variant_function
Currently working on VCF file: normal_blood_WGS, field hg19_tfbsConsSites
Currently working on VCF file: normal_blood_WGS, field hg19_cytoBand
Currently working on VCF file: normal_blood_WGS, field hg19_targetScanS
Currently working on VCF file: normal_blood_WGS, field hg19_genomicSuperDups
Currently working on VCF file: normal_blood_WGS, field hg19_gwasCatalog
Currently working on VCF file: normal_blood_WGS, field hg19_esp6500siv2_all_filtered
Currently working on VCF file: normal_blood_WGS, field hg19_esp6500siv2_all_dropped
Currently working on VCF file: normal_blood_WGS, field 2015_08_filtered
Currently working on VCF file: normal_blood_WGS, field 2015_08_dropped
Currently working on VCF file: normal_blood_WGS, field hg19_snp138_filtered
Currently working on VCF file: normal_blood_WGS, field hg19_snp138_dropped

In [ ]:
#Annovar runs as a subprocess on every file. They will run in parallel for speed up. 
for i in range(0, len(FILE_NAMES)):
    utilities.run_annovar(ANNOVAR_PATH, IN_PATH+FILE_NAMES[i], OUT_PATH)

#This serves to give a real-time feedback of the ANNOVAR progress and status.

Specify the name and location of the csv file that ANNOVAR produces as output


In [4]:
filepath_out = '/data/ccbb_internal/interns/Carlo/test_vcf_out/'
filepath_in = '/data/ccbb_internal/interns/Carlo/test_vcf/'

#For safety, check the files in directory. Either run '!ls' here on iPython, or go to the directory and check 
#manually for existing files. There should be once csv file for every vcf file. 

VCF_FILE_NAME = 'Tumor_RNAseq_variants.vcf'
CSV_FILE_NAME = 'Tumor_RNAseq_variants.hg19_multianno.csv'

Getting data from myvariant.info

The package offers 4 different methods to obtain variant data. Two of them require annovar, while the other two are based solely on the use of myvariant.info service. The latter can be used without having to worry about downloading and installing annovar databases, but it tends to return partial or no information for some variants.

The different methods also enable the user to decide how the data will be parsed to MongoDB. 1 and 3 parse the data by chunks: the user specifies a number of variants (usually 1000), and the data from the vcf and csv files are parsed as soon as those 1000 variants are processed and integrated. This enables huge files to be processed without having to hold them in memory and potentially cause a Memory Overflow error.

Methods 2 and 4, on the other hand, process the files on their entirety and send them to MongoDB at once. Well-suited for smaller files. See docs for more info.

Export data to MongoDB by chunks, iteratively.

For this tutorial, we will use method #1. Data from annovar (as a csv file) will be obtained 1000 lines at a time, instead of attempting to parse and process an entire csv file at once.

As soon as you run the scripts from variantannotaiton, variant data will be retrieved from myvariant.info and the data will automatically be integrated and stored to MongoDB. Database and collection name should be specified, and there must be a running MongoDB connection. The script will set up a client to communicate between python (through pymongo) and the the database.

In general, the shell command:

mongod --dbpath ../data/db

where data/db is the designated location where the data will be stored, will initiate MongoDB. After this, the script should store data to the directory automatically. For pymongo, and more information on how to set up a Mongo Database: https://docs.mongodb.com/getting-started/python/


In [17]:
chunksize = 10000
step = 0

#Get variant list. Should always be the first step after running ANNOVAR
open_file = myvariant_parsing_utils.VariantParsing()

#Name Collections & DB. Change them to something appropriate. Each file should live in a collection
db_name = 'Variant_Prioritization_Workflow'

collection_name = 'Test_Tumor_RNAseq'

list_file = open_file.get_variants_from_vcf(filepath_in+VCF_FILE_NAME)
as_batch = annotate_batch.AnnotationMethods()
as_batch.by_chunks(list_file, chunksize, step, filepath_out+CSV_FILE_NAME, collection_name, db_name)


Converting columns to float ...
Processing knownGene info ...
Processing tfbsConsSites info ...
Processing genomicSuperDups info ...
Processing cytoBand info ...
Creating hgvs key ...
Processing genotype call info ...
Transforming to JSON from dataFrame
cleaning up...
Done
querying 1-1000...done.
querying 1001-2000...done.
querying 2001-3000...done.
querying 3001-4000...done.
querying 4001-5000...done.
querying 5001-6000...done.
querying 6001-7000...done.
querying 7001-8000...done.
querying 8001-9000...done.
querying 9001-10000...done.
Joining lists ...
Parsing to MongoDB ...
Step: 1 of 43
Converting columns to float ...
Processing knownGene info ...
Processing tfbsConsSites info ...
Processing genomicSuperDups info ...
Processing cytoBand info ...
Creating hgvs key ...
Processing genotype call info ...
Transforming to JSON from dataFrame
cleaning up...
Done
querying 1-1000...done.
querying 1001-2000...done.
querying 2001-3000...done.
querying 3001-4000...done.
querying 4001-5000...done.
querying 5001-6000...done.
querying 6001-7000...done.
querying 7001-8000...done.
querying 8001-9000...done.
querying 9001-10000...done.
Joining lists ...
Parsing to MongoDB ...
Step: 2 of 43
Converting columns to float ...
Processing knownGene info ...
Processing tfbsConsSites info ...
Processing genomicSuperDups info ...
Processing cytoBand info ...
Creating hgvs key ...
Processing genotype call info ...
Transforming to JSON from dataFrame
cleaning up...
Done
querying 1-1000...done.
querying 1001-2000...done.
querying 2001-3000...done.
querying 3001-4000...done.
querying 4001-5000...done.
querying 5001-6000...done.
querying 6001-7000...done.
querying 7001-8000...done.
querying 8001-9000...done.
querying 9001-10000...done.
Joining lists ...
Parsing to MongoDB ...
Step: 3 of 43
Converting columns to float ...
Processing knownGene info ...
Processing tfbsConsSites info ...
Processing genomicSuperDups info ...
Processing cytoBand info ...
Creating hgvs key ...
Processing genotype call info ...
Transforming to JSON from dataFrame
cleaning up...
Done
querying 1-1000...done.
querying 1001-2000...done.
querying 2001-3000...done.
querying 3001-4000...done.
querying 4001-5000...done.
querying 5001-6000...done.
querying 6001-7000...done.
querying 7001-8000...done.
querying 8001-9000...done.
querying 9001-10000...done.
Joining lists ...
Parsing to MongoDB ...
Step: 4 of 43
Converting columns to float ...
Processing knownGene info ...
Processing tfbsConsSites info ...
Processing genomicSuperDups info ...
Processing cytoBand info ...
Creating hgvs key ...
Processing genotype call info ...
Transforming to JSON from dataFrame
cleaning up...
Done
querying 1-1000...done.
querying 1001-2000...done.
querying 2001-3000...done.
querying 3001-4000...done.
querying 4001-5000...done.
querying 5001-6000...done.
querying 6001-7000...done.
querying 7001-8000...done.
querying 8001-9000...done.
querying 9001-10000...done.
Joining lists ...
Parsing to MongoDB ...
Step: 5 of 43
Converting columns to float ...
Processing knownGene info ...
Processing tfbsConsSites info ...
Processing genomicSuperDups info ...
Processing cytoBand info ...
Creating hgvs key ...
Processing genotype call info ...
Transforming to JSON from dataFrame
cleaning up...
Done
querying 1-1000...done.
querying 1001-2000...done.
querying 2001-3000...done.
querying 3001-4000...done.
querying 4001-5000...done.
querying 5001-6000...done.
querying 6001-7000...done.
querying 7001-8000...done.
querying 8001-9000...done.
querying 9001-10000...done.
Joining lists ...
Parsing to MongoDB ...
Step: 6 of 43
Converting columns to float ...
Processing knownGene info ...
Processing tfbsConsSites info ...
Processing genomicSuperDups info ...
Processing cytoBand info ...
Creating hgvs key ...
Processing genotype call info ...
Transforming to JSON from dataFrame
cleaning up...
Done
querying 1-1000...done.
querying 1001-2000...done.
querying 2001-3000...done.
querying 3001-4000...done.
querying 4001-5000...done.
querying 5001-6000...done.
querying 6001-7000...done.
querying 7001-8000...done.
querying 8001-9000...done.
querying 9001-10000...done.
Joining lists ...
Parsing to MongoDB ...
Step: 7 of 43
Converting columns to float ...
Processing knownGene info ...
Processing tfbsConsSites info ...
Processing genomicSuperDups info ...
Processing cytoBand info ...
Creating hgvs key ...
Processing genotype call info ...
Transforming to JSON from dataFrame
cleaning up...
Done
querying 1-1000...done.
querying 1001-2000...done.
querying 2001-3000...done.
querying 3001-4000...done.
querying 4001-5000...done.
querying 5001-6000...done.
querying 6001-7000...done.
querying 7001-8000...done.
querying 8001-9000...done.
querying 9001-10000...done.
Joining lists ...
Parsing to MongoDB ...
Step: 8 of 43
Converting columns to float ...
Processing knownGene info ...
Processing tfbsConsSites info ...
Processing genomicSuperDups info ...
Processing cytoBand info ...
Creating hgvs key ...
Processing genotype call info ...
Transforming to JSON from dataFrame
cleaning up...
Done
querying 1-1000...done.
querying 1001-2000...done.
querying 2001-3000...done.
querying 3001-4000...done.
querying 4001-5000...done.
querying 5001-6000...done.
querying 6001-7000...done.
querying 7001-8000...done.
querying 8001-9000...done.
querying 9001-10000...done.
Joining lists ...
Parsing to MongoDB ...
Step: 9 of 43
Converting columns to float ...
Processing knownGene info ...
Processing tfbsConsSites info ...
Processing genomicSuperDups info ...
Processing cytoBand info ...
Creating hgvs key ...
Processing genotype call info ...
Transforming to JSON from dataFrame
cleaning up...
Done
querying 1-1000...done.
querying 1001-2000...done.
querying 2001-3000...done.
querying 3001-4000...done.
querying 4001-5000...done.
querying 5001-6000...done.
querying 6001-7000...done.
querying 7001-8000...done.
querying 8001-9000...done.
querying 9001-10000...done.
Joining lists ...
Parsing to MongoDB ...
Step: 10 of 43
Converting columns to float ...
Processing knownGene info ...
Processing tfbsConsSites info ...
Processing genomicSuperDups info ...
Processing cytoBand info ...
Creating hgvs key ...
Processing genotype call info ...
Transforming to JSON from dataFrame
cleaning up...
Done
querying 1-1000...done.
querying 1001-2000...done.
querying 2001-3000...done.
querying 3001-4000...done.
querying 4001-5000...done.
querying 5001-6000...done.
querying 6001-7000...done.
querying 7001-8000...done.
querying 8001-9000...done.
querying 9001-10000...done.
Joining lists ...
Parsing to MongoDB ...
Step: 11 of 43
Converting columns to float ...
Processing knownGene info ...
Processing tfbsConsSites info ...
Processing genomicSuperDups info ...
Processing cytoBand info ...
Creating hgvs key ...
Processing genotype call info ...
Transforming to JSON from dataFrame
cleaning up...
Done
querying 1-1000...done.
querying 1001-2000...done.
querying 2001-3000...done.
querying 3001-4000...done.
querying 4001-5000...done.
querying 5001-6000...done.
querying 6001-7000...done.
querying 7001-8000...done.
querying 8001-9000...done.
querying 9001-10000...done.
Joining lists ...
Parsing to MongoDB ...
Step: 12 of 43
Converting columns to float ...
Processing knownGene info ...
Processing tfbsConsSites info ...
Processing genomicSuperDups info ...
Processing cytoBand info ...
Creating hgvs key ...
Processing genotype call info ...
Transforming to JSON from dataFrame
cleaning up...
Done
querying 1-1000...done.
querying 1001-2000...done.
querying 2001-3000...done.
querying 3001-4000...done.
querying 4001-5000...done.
querying 5001-6000...done.
querying 6001-7000...done.
querying 7001-8000...done.
querying 8001-9000...done.
querying 9001-10000...done.
Joining lists ...
Parsing to MongoDB ...
Step: 13 of 43
Converting columns to float ...
Processing knownGene info ...
Processing tfbsConsSites info ...
Processing genomicSuperDups info ...
Processing cytoBand info ...
Creating hgvs key ...
Processing genotype call info ...
Transforming to JSON from dataFrame
cleaning up...
Done
querying 1-1000...done.
querying 1001-2000...done.
querying 2001-3000...done.
querying 3001-4000...done.
querying 4001-5000...done.
querying 5001-6000...done.
querying 6001-7000...done.
querying 7001-8000...done.
querying 8001-9000...done.
querying 9001-10000...done.
Joining lists ...
Parsing to MongoDB ...
Step: 14 of 43
Converting columns to float ...
Processing knownGene info ...
Processing tfbsConsSites info ...
Processing genomicSuperDups info ...
Processing cytoBand info ...
Creating hgvs key ...
Processing genotype call info ...
Transforming to JSON from dataFrame
cleaning up...
Done
querying 1-1000...done.
querying 1001-2000...done.
querying 2001-3000...done.
querying 3001-4000...done.
querying 4001-5000...done.
querying 5001-6000...done.
querying 6001-7000...done.
querying 7001-8000...done.
querying 8001-9000...done.
querying 9001-10000...done.
Joining lists ...
Parsing to MongoDB ...
Step: 15 of 43
Converting columns to float ...
Processing knownGene info ...
Processing tfbsConsSites info ...
Processing genomicSuperDups info ...
Processing cytoBand info ...
Creating hgvs key ...
Processing genotype call info ...
Transforming to JSON from dataFrame
cleaning up...
Done
querying 1-1000...done.
querying 1001-2000...done.
querying 2001-3000...done.
querying 3001-4000...done.
querying 4001-5000...done.
querying 5001-6000...done.
querying 6001-7000...done.
querying 7001-8000...done.
querying 8001-9000...done.
querying 9001-10000...done.
Joining lists ...
Parsing to MongoDB ...
Step: 16 of 43
Converting columns to float ...
Processing knownGene info ...
Processing tfbsConsSites info ...
Processing genomicSuperDups info ...
Processing cytoBand info ...
Creating hgvs key ...
Processing genotype call info ...
Transforming to JSON from dataFrame
cleaning up...
Done
querying 1-1000...done.
querying 1001-2000...done.
querying 2001-3000...done.
querying 3001-4000...done.
querying 4001-5000...done.
querying 5001-6000...done.
querying 6001-7000...done.
querying 7001-8000...done.
querying 8001-9000...done.
querying 9001-10000...done.
Joining lists ...
Parsing to MongoDB ...
Step: 17 of 43
Converting columns to float ...
Processing knownGene info ...
Processing tfbsConsSites info ...
Processing genomicSuperDups info ...
Processing cytoBand info ...
Creating hgvs key ...
Processing genotype call info ...
Transforming to JSON from dataFrame
cleaning up...
Done
querying 1-1000...done.
querying 1001-2000...done.
querying 2001-3000...done.
querying 3001-4000...done.
querying 4001-5000...done.
querying 5001-6000...done.
querying 6001-7000...done.
querying 7001-8000...done.
querying 8001-9000...done.
querying 9001-10000...done.
Joining lists ...
Parsing to MongoDB ...
Step: 18 of 43
Converting columns to float ...
Processing knownGene info ...
Processing tfbsConsSites info ...
Processing genomicSuperDups info ...
Processing cytoBand info ...
Creating hgvs key ...
Processing genotype call info ...
Transforming to JSON from dataFrame
cleaning up...
Done
querying 1-1000...done.
querying 1001-2000...done.
querying 2001-3000...done.
querying 3001-4000...done.
querying 4001-5000...done.
querying 5001-6000...done.
querying 6001-7000...done.
querying 7001-8000...done.
querying 8001-9000...done.
querying 9001-10000...done.
Joining lists ...
Parsing to MongoDB ...
Step: 19 of 43
Converting columns to float ...
Processing knownGene info ...
Processing tfbsConsSites info ...
Processing genomicSuperDups info ...
Processing cytoBand info ...
Creating hgvs key ...
Processing genotype call info ...
Transforming to JSON from dataFrame
cleaning up...
Done
querying 1-1000...done.
querying 1001-2000...done.
querying 2001-3000...done.
querying 3001-4000...done.
querying 4001-5000...done.
querying 5001-6000...done.
querying 6001-7000...done.
querying 7001-8000...done.
querying 8001-9000...done.
querying 9001-10000...done.
Joining lists ...
Parsing to MongoDB ...
Step: 20 of 43
Converting columns to float ...
Processing knownGene info ...
Processing tfbsConsSites info ...
Processing genomicSuperDups info ...
Processing cytoBand info ...
Creating hgvs key ...
Processing genotype call info ...
Transforming to JSON from dataFrame
cleaning up...
Done
querying 1-1000...done.
querying 1001-2000...done.
querying 2001-3000...done.
querying 3001-4000...done.
querying 4001-5000...done.
querying 5001-6000...done.
querying 6001-7000...done.
querying 7001-8000...done.
querying 8001-9000...done.
querying 9001-10000...done.
Joining lists ...
Parsing to MongoDB ...
Step: 21 of 43
Converting columns to float ...
Processing knownGene info ...
Processing tfbsConsSites info ...
Processing genomicSuperDups info ...
Processing cytoBand info ...
Creating hgvs key ...
Processing genotype call info ...
Transforming to JSON from dataFrame
cleaning up...
Done
querying 1-1000...done.
querying 1001-2000...done.
querying 2001-3000...done.
querying 3001-4000...done.
querying 4001-5000...done.
querying 5001-6000...done.
querying 6001-7000...done.
querying 7001-8000...done.
querying 8001-9000...done.
querying 9001-10000...done.
Joining lists ...
Parsing to MongoDB ...
Step: 22 of 43
Converting columns to float ...
Processing knownGene info ...
Processing tfbsConsSites info ...
Processing genomicSuperDups info ...
Processing cytoBand info ...
Creating hgvs key ...
Processing genotype call info ...
Transforming to JSON from dataFrame
cleaning up...
Done
querying 1-1000...done.
querying 1001-2000...done.
querying 2001-3000...done.
querying 3001-4000...done.
querying 4001-5000...done.
querying 5001-6000...done.
querying 6001-7000...done.
querying 7001-8000...done.
querying 8001-9000...done.
querying 9001-10000...done.
Joining lists ...
Parsing to MongoDB ...
Step: 23 of 43
Converting columns to float ...
Processing knownGene info ...
Processing tfbsConsSites info ...
Processing genomicSuperDups info ...
Processing cytoBand info ...
Creating hgvs key ...
Processing genotype call info ...
Transforming to JSON from dataFrame
cleaning up...
Done
querying 1-1000...done.
querying 1001-2000...done.
querying 2001-3000...done.
querying 3001-4000...done.
querying 4001-5000...done.
querying 5001-6000...done.
querying 6001-7000...done.
querying 7001-8000...done.
querying 8001-9000...done.
querying 9001-10000...done.
Joining lists ...
Parsing to MongoDB ...
Step: 24 of 43
Converting columns to float ...
Processing knownGene info ...
Processing tfbsConsSites info ...
Processing genomicSuperDups info ...
Processing cytoBand info ...
Creating hgvs key ...
Processing genotype call info ...
Transforming to JSON from dataFrame
cleaning up...
Done
querying 1-1000...done.
querying 1001-2000...done.
querying 2001-3000...done.
querying 3001-4000...done.
querying 4001-5000...done.
querying 5001-6000...done.
querying 6001-7000...done.
querying 7001-8000...done.
querying 8001-9000...done.
querying 9001-10000...done.
Joining lists ...
Parsing to MongoDB ...
Step: 25 of 43
Converting columns to float ...
Processing knownGene info ...
Processing tfbsConsSites info ...
Processing genomicSuperDups info ...
Processing cytoBand info ...
Creating hgvs key ...
Processing genotype call info ...
Transforming to JSON from dataFrame
cleaning up...
Done
querying 1-1000...done.
querying 1001-2000...done.
querying 2001-3000...done.
querying 3001-4000...done.
querying 4001-5000...done.
querying 5001-6000...done.
querying 6001-7000...done.
querying 7001-8000...done.
querying 8001-9000...done.
querying 9001-10000...done.
Joining lists ...
Parsing to MongoDB ...
Step: 26 of 43
Converting columns to float ...
Processing knownGene info ...
Processing tfbsConsSites info ...
Processing genomicSuperDups info ...
Processing cytoBand info ...
Creating hgvs key ...
Processing genotype call info ...
Transforming to JSON from dataFrame
cleaning up...
Done
querying 1-1000...done.
querying 1001-2000...done.
querying 2001-3000...done.
querying 3001-4000...done.
querying 4001-5000...done.
querying 5001-6000...done.
querying 6001-7000...done.
querying 7001-8000...done.
querying 8001-9000...done.
querying 9001-10000...done.
Joining lists ...
Parsing to MongoDB ...
Step: 27 of 43
Converting columns to float ...
Processing knownGene info ...
Processing tfbsConsSites info ...
Processing genomicSuperDups info ...
Processing cytoBand info ...
Creating hgvs key ...
Processing genotype call info ...
Transforming to JSON from dataFrame
cleaning up...
Done
querying 1-1000...done.
querying 1001-2000...done.
querying 2001-3000...done.
querying 3001-4000...done.
querying 4001-5000...done.
querying 5001-6000...done.
querying 6001-7000...done.
querying 7001-8000...done.
querying 8001-9000...done.
querying 9001-10000...done.
Joining lists ...
Parsing to MongoDB ...
Step: 28 of 43
Converting columns to float ...
Processing knownGene info ...
Processing tfbsConsSites info ...
Processing genomicSuperDups info ...
Processing cytoBand info ...
Creating hgvs key ...
Processing genotype call info ...
Transforming to JSON from dataFrame
cleaning up...
Done
querying 1-1000...done.
querying 1001-2000...done.
querying 2001-3000...done.
querying 3001-4000...done.
querying 4001-5000...done.
querying 5001-6000...done.
querying 6001-7000...done.
querying 7001-8000...done.
querying 8001-9000...done.
querying 9001-10000...done.
Joining lists ...
Parsing to MongoDB ...
Step: 29 of 43
Converting columns to float ...
Processing knownGene info ...
Processing tfbsConsSites info ...
Processing genomicSuperDups info ...
Processing cytoBand info ...
Creating hgvs key ...
Processing genotype call info ...
Transforming to JSON from dataFrame
cleaning up...
Done
querying 1-1000...done.
querying 1001-2000...done.
querying 2001-3000...done.
querying 3001-4000...done.
querying 4001-5000...done.
querying 5001-6000...done.
querying 6001-7000...done.
querying 7001-8000...done.
querying 8001-9000...done.
querying 9001-10000...done.
Joining lists ...
Parsing to MongoDB ...
Step: 30 of 43
Converting columns to float ...
Processing knownGene info ...
Processing tfbsConsSites info ...
Processing genomicSuperDups info ...
Processing cytoBand info ...
Creating hgvs key ...
Processing genotype call info ...
Transforming to JSON from dataFrame
cleaning up...
Done
querying 1-1000...done.
querying 1001-2000...done.
querying 2001-3000...done.
querying 3001-4000...done.
querying 4001-5000...done.
querying 5001-6000...done.
querying 6001-7000...done.
querying 7001-8000...done.
querying 8001-9000...done.
querying 9001-10000...done.
Joining lists ...
Parsing to MongoDB ...
Step: 31 of 43
Converting columns to float ...
Processing knownGene info ...
Processing tfbsConsSites info ...
Processing genomicSuperDups info ...
Processing cytoBand info ...
Creating hgvs key ...
Processing genotype call info ...
Transforming to JSON from dataFrame
cleaning up...
Done
querying 1-1000...done.
querying 1001-2000...done.
querying 2001-3000...done.
querying 3001-4000...done.
querying 4001-5000...done.
querying 5001-6000...done.
querying 6001-7000...done.
querying 7001-8000...done.
querying 8001-9000...done.
querying 9001-10000...done.
Joining lists ...
Parsing to MongoDB ...
Step: 32 of 43
Converting columns to float ...
Processing knownGene info ...
Processing tfbsConsSites info ...
Processing genomicSuperDups info ...
Processing cytoBand info ...
Creating hgvs key ...
Processing genotype call info ...
Transforming to JSON from dataFrame
cleaning up...
Done
querying 1-1000...done.
querying 1001-2000...done.
querying 2001-3000...done.
querying 3001-4000...done.
querying 4001-5000...done.
querying 5001-6000...done.
querying 6001-7000...done.
querying 7001-8000...done.
querying 8001-9000...done.
querying 9001-10000...done.
Joining lists ...
Parsing to MongoDB ...
Step: 33 of 43
Converting columns to float ...
Processing knownGene info ...
Processing tfbsConsSites info ...
Processing genomicSuperDups info ...
Processing cytoBand info ...
Creating hgvs key ...
Processing genotype call info ...
Transforming to JSON from dataFrame
cleaning up...
Done
querying 1-1000...done.
querying 1001-2000...done.
querying 2001-3000...done.
querying 3001-4000...done.
querying 4001-5000...done.
querying 5001-6000...done.
querying 6001-7000...done.
querying 7001-8000...done.
querying 8001-9000...done.
querying 9001-10000...done.
Joining lists ...
Parsing to MongoDB ...
Step: 34 of 43
Converting columns to float ...
Processing knownGene info ...
Processing tfbsConsSites info ...
Processing genomicSuperDups info ...
Processing cytoBand info ...
Creating hgvs key ...
Processing genotype call info ...
Transforming to JSON from dataFrame
cleaning up...
Done
querying 1-1000...done.
querying 1001-2000...done.
querying 2001-3000...done.
querying 3001-4000...done.
querying 4001-5000...done.
querying 5001-6000...done.
querying 6001-7000...done.
querying 7001-8000...done.
querying 8001-9000...done.
querying 9001-10000...done.
Joining lists ...
Parsing to MongoDB ...
Step: 35 of 43
Converting columns to float ...
Processing knownGene info ...
Processing tfbsConsSites info ...
Processing genomicSuperDups info ...
Processing cytoBand info ...
Creating hgvs key ...
Processing genotype call info ...
Transforming to JSON from dataFrame
cleaning up...
Done
querying 1-1000...done.
querying 1001-2000...done.
querying 2001-3000...done.
querying 3001-4000...done.
querying 4001-5000...done.
querying 5001-6000...done.
querying 6001-7000...done.
querying 7001-8000...done.
querying 8001-9000...done.
querying 9001-10000...done.
Joining lists ...
Parsing to MongoDB ...
Step: 36 of 43
Converting columns to float ...
Processing knownGene info ...
Processing tfbsConsSites info ...
Processing genomicSuperDups info ...
Processing cytoBand info ...
Creating hgvs key ...
Processing genotype call info ...
Transforming to JSON from dataFrame
cleaning up...
Done
querying 1-1000...done.
querying 1001-2000...done.
querying 2001-3000...done.
querying 3001-4000...done.
querying 4001-5000...done.
querying 5001-6000...done.
querying 6001-7000...done.
querying 7001-8000...done.
querying 8001-9000...done.
querying 9001-10000...done.
Joining lists ...
Parsing to MongoDB ...
Step: 37 of 43
Converting columns to float ...
Processing knownGene info ...
Processing tfbsConsSites info ...
Processing genomicSuperDups info ...
Processing cytoBand info ...
Creating hgvs key ...
Processing genotype call info ...
Transforming to JSON from dataFrame
cleaning up...
Done
querying 1-1000...done.
querying 1001-2000...done.
querying 2001-3000...done.
querying 3001-4000...done.
querying 4001-5000...done.
querying 5001-6000...done.
querying 6001-7000...done.
querying 7001-8000...done.
querying 8001-9000...done.
querying 9001-10000...done.
Joining lists ...
Parsing to MongoDB ...
Step: 38 of 43
Converting columns to float ...
Processing knownGene info ...
Processing tfbsConsSites info ...
Processing genomicSuperDups info ...
Processing cytoBand info ...
Creating hgvs key ...
Processing genotype call info ...
Transforming to JSON from dataFrame
cleaning up...
Done
querying 1-1000...done.
querying 1001-2000...done.
querying 2001-3000...done.
querying 3001-4000...done.
querying 4001-5000...done.
querying 5001-6000...done.
querying 6001-7000...done.
querying 7001-8000...done.
querying 8001-9000...done.
querying 9001-10000...done.
Joining lists ...
Parsing to MongoDB ...
Step: 39 of 43
Converting columns to float ...
Processing knownGene info ...
Processing tfbsConsSites info ...
Processing genomicSuperDups info ...
Processing cytoBand info ...
Creating hgvs key ...
Processing genotype call info ...
Transforming to JSON from dataFrame
cleaning up...
Done
querying 1-1000...done.
querying 1001-2000...done.
querying 2001-3000...done.
querying 3001-4000...done.
querying 4001-5000...done.
querying 5001-6000...done.
querying 6001-7000...done.
querying 7001-8000...done.
querying 8001-9000...done.
querying 9001-10000...done.
Joining lists ...
Parsing to MongoDB ...
Step: 40 of 43
Converting columns to float ...
Processing knownGene info ...
Processing tfbsConsSites info ...
Processing genomicSuperDups info ...
Processing cytoBand info ...
Creating hgvs key ...
Processing genotype call info ...
Transforming to JSON from dataFrame
cleaning up...
Done
querying 1-1000...done.
querying 1001-2000...done.
querying 2001-3000...done.
querying 3001-4000...done.
querying 4001-5000...done.
querying 5001-6000...done.
querying 6001-7000...done.
querying 7001-8000...done.
querying 8001-9000...done.
querying 9001-10000...done.
Joining lists ...
Parsing to MongoDB ...
Step: 41 of 43
Converting columns to float ...
Processing knownGene info ...
Processing tfbsConsSites info ...
Processing genomicSuperDups info ...
Processing cytoBand info ...
Creating hgvs key ...
Processing genotype call info ...
Transforming to JSON from dataFrame
cleaning up...
Done
querying 1-1000...done.
querying 1001-2000...done.
querying 2001-3000...done.
querying 3001-4000...done.
querying 4001-5000...done.
querying 5001-6000...done.
querying 6001-7000...done.
querying 7001-8000...done.
querying 8001-9000...done.
querying 9001-10000...done.
Joining lists ...
Parsing to MongoDB ...
Step: 42 of 43
Converting columns to float ...
Processing knownGene info ...
Processing tfbsConsSites info ...
Processing genomicSuperDups info ...
Processing cytoBand info ...
Creating hgvs key ...
Processing genotype call info ...
Transforming to JSON from dataFrame
cleaning up...
Done
querying 1-1000...done.
querying 1001-2000...done.
querying 2001-2710...done.
Joining lists ...
Parsing to MongoDB ...
Step: 43 of 43
Out[17]:
'Finished!'

Variant Filtering & Output Files

Here we implement three different filters that allow for the retrieval of specific variants. The filters are implemented as MongoDB queries, and are designed to provie the user with a set of relevant variants. In case the user would like to define its own querying, a template is provided. The output of the queries is a list of dictionaries (JSON documents), where each dictionary contains data reltive to one variant.

Further, the package allows the user to parse these variants into an annotated csv or vcf file. If needed, annotated, unfiltered vcf and csv files can also be created. They will have the same length (number of variants) as the original files, but will contain much more complete annotation data coming from myvariant.info and ANNOVAR databases.

To create a csv file, just the filtered output is needed. To create an annotated vcf file, a tab indexed file (.tbi) file is needed (see comments in section Create unfiltered annotated vcf and csv files at the end of this page). This can be created using tabix.

First, the file needs to be compressed:

From the command line, running:

bgzip -c input_file.vcf > input_file.vcf.gz

returns input_vcf_file.vcf.gz

and running

tabix input_vcf_file.vcf.gz

will return: input_vcf_file.vcf.gz.tbi

Filter #1: specifying cancer-specifc rare variants

  • filter 1: ThousandGenomeAll < 0.05 or info not available
  • filter 2: ESP6500siv2_all < 0.05 or info not available
  • filter 3: cosmic70 information is present
  • filter 4: Func_knownGene is exonic, splicing, or both
  • filter 5: ExonicFunc_knownGene is not "synonymous SNV"
  • filter 6: Read Depth (DP) > 10

In [5]:
filepath = '/data/ccbb_internal/interns/Carlo'

In [9]:
#Create output files (if needed): specify name of files and path 
rare_cancer_variants_csv = filepath + "/tumor_rna_rare_cancer_vars_csv.csv"
rare_cancer_variants_vcf = filepath + "/tumor_rna_rare_cancer_vars_vcf.vcf"
input_vcf_compressed = filepath + '/test_vcf/Tumor_RNAseq_variants.vcf.gz'

#Apply filter.
filter_collection = MongoDB_querying.Filters(db_name, collection_name)
rare_cancer_variants = filter_collection.rare_cancer_variant()

#Crete writer object for filtered lists:
my_writer = create_output_files.FileWriter(db_name, collection_name)

#cancer variants filtered files
my_writer.generate_annotated_csv(rare_cancer_variants, rare_cancer_variants_csv)
my_writer.generate_annotated_vcf(rare_cancer_variants,input_vcf_compressed, rare_cancer_variants_vcf)


Variants found that match rarity criteria: 11
Out[9]:
'Finished writing annotated, filtered VCF file'

Filter #2: specifying rare disease-specifc (rare) variants

  • filter 1: ThousandGenomeAll < 0.05 or info not available
  • filter 2: ESP6500siv2_all < 0.05 or info not available
  • filter 3: cosmic70 information is present
  • filter 4: Func_knownGene is exonic, splicing, or both
  • filter 5: ExonicFunc_knownGene is not "synonymous SNV"
  • filter 6: Read Depth (DP) > 10
  • filter 7: Clinvar data is present

In [13]:
#Apply filter.
filter_collection = MongoDB_querying.Filters(db_name, collection_name)
rare_disease_variants = filter_collection.rare_disease_variant()


Variants found that match rarity criteria: 0

Zero variants found. Writing a csv output won't make much sense. You can still customize the filters the way you'd like, as you can see below.

Create your own filter

As long as you have a MongoDB instance running, filtering can be perfomed trough pymongo as shown by the code below. If a list is intended to be created from it, simply add: filter2 = list(filter2)

If you'd like to customize your filters, a good idea would be to look at the available fields to be filtered. Looking at the myvariant.info documentation, you can see what are all the fields avaialble and can be used for filtering.


In [ ]:
from pymongo import MongoClient

client = MongoClient()
db = client.My_Variant_Database
collection = db.ANNOVAR_MyVariant_chunks

filter2 = collection.find({ "$and": [
                                 {"$or": [{"ThousandGenomeAll": {"$lt": 0.05}}, {"ThousandGenomeAll": {"$exists": False}}]},
                                 {"$or": [{"ESP6500siv2_all": { "$lt": 0.05}}, {"ESP6500siv2_all": { "$exists": False}}]},
                                 {"$or": [{"Func_knownGene": "exonic"}, {"Func_knownGene": "splicing"}]},
                                 {"ExonicFunc_knownGene": {"$ne": "synonymous SNV"}},
                                 {"Genotype_Call.DP": {"$gte": 10}},
                                 {"cosmic70": { "$exists": True}}
                         ]})

Create unfiltered annotated vcf and csv files

Let's write an output file that contains all annotation data. This may be useful for researchers interested in obtaining a full description of their files.


In [15]:
#Create output files (if needed): specify name of files and path 
out_unfiltered_vcf_file = filepath + "/out_unfiltered_rnaseq_vcf.vcf"
out_unfiltered_csv_file = filepath + "/out_unfiltered_rnaseq_csv.csv"
input_vcf_compressed = filepath + '/test_vcf/Tumor_RNAseq_variants.vcf.gz'

#Create writer object
#db and collection name must be specified since no list is given. The entire collection will be queried. 
my_writer = create_output_files.FileWriter(db_name, collection_name)

#Write collection to csv and vcf
#The in_vcf_file must be the .vcf.gz file and it needs to have an associated .tbi file. 


my_writer.generate_unfiltered_annotated_csv(out_unfiltered_csv_file)
my_writer.generate_unfiltered_annotated_vcf(input_vcf_compressed, out_unfiltered_vcf_file)


Out[15]:
'Finished writing annotated VCF file'

Do some cool downstream analysis on your newly created datasets :)

...